Nowadays, we have a large amount of text data, documents, articles, reports, etc. The problem is that it is difficult to read all this text data manually. We have to spend a significant amount of time understanding the text data. It is even more difficult to find relationships between the concepts in the text data. We have to deal with a lot of text data to understand things better. This paper presents a framework that converts text or files such as PDF, DOC into a structured knowledge graph. It does this by using deep learning and Graph Neural Networks together. The system efficiently processing files like PDF and DOC and converting them into a knowledge graph that is easy to understand. The system uses the REBEL model, which is based on the BART architecture. The model helps to find entities and their connections by performing joint extraction of subject–relation–object triplets directly from text in a single step. This approach significantly reduces cascaded errors. These triplets are used to build a knowledge graph, where the nodes represent the entities and edges represent the relationship between the entities. A graph-based storage layer is used to
manage this data. This allows for organization of nodes and edges. To improve the extracted information, Relational Graph Convolutional Networks (R-GCN) are used to capture both structural and semantic relationships in the data. This allows the system to better understandthe context and connections between entities. It supports more effective information retrieval. The proposed framework demonstrates how integrating natural language processing with the graph-based learning can provide a scalable solution for transforming unstructured data into meaningful knowledge. The web interface of the system is developed using Flask, a lightweight Python-based framework, it allows the users to upload the documents or enter the text then the system generates a dynamic graph. Where identified concepts are mapped as nodes (entities) and their interactions as directed edges (relationships). The interface also includes a data view that displays the extracted triplets, allowing users to verify the specific subject-relationship-object pairs and execute queries based on their uploaded text.
Introduction
The text discusses the development of an intelligent Knowledge Graph Extraction and Reasoning System that converts large amounts of unstructured text from research papers and articles into structured knowledge graphs. Traditional search engines can find information but cannot clearly show relationships between concepts. Knowledge Graphs solve this issue by connecting entities and relationships, though manual creation is difficult and time-consuming.
Earlier systems mainly focused on Named Entity Recognition (NER) and relation extraction separately using machine learning methods like Support Vector Machines (SVM), which often caused cascading errors. Transformer-based models such as BERT improved Natural Language Processing by understanding contextual meaning more effectively. Recent approaches like REBEL combine entity and relation extraction into a single process, reducing errors and improving efficiency.
The proposed framework integrates REBEL with Relational Graph Convolutional Networks (R-GCNs) to create structured multi-relational knowledge graphs. The system extracts subject–relation–object triplets from text, converts them into graph structures, and uses graph-based reasoning to identify hidden relationships and semantic connections. Advanced graph learning techniques such as Graph Transformer Networks and dynamic graph learning further improve adaptability and accuracy.
The architecture consists of four layers: data extraction, knowledge management, structural reasoning, and visualization. Flask is used for the web platform, PyTorch powers the AI inference, NetworkX manages graph structures, Pickle stores graph data, and PyVis provides interactive graph visualization. Users can upload documents, explore relationships interactively, perform multi-hop querying, and export graph data.
Experimental evaluation using datasets such as CoNLL-2003 and SemEval-2010 showed that the proposed REBEL-based system outperformed TF-IDF and BERT-SVM models, achieving 52.3% accuracy and an F1-score of 0.52. The system also demonstrated effective graph reasoning through Mean Reciprocal Rank (MRR) analysis and interactive visualization. Overall, the framework provides an efficient, scalable, and intelligent solution for automatic knowledge extraction, graph construction, and semantic reasoning from unstructured text data.
Conclusion
In this project, we developed a modular pipeline that converts unstructured, messy text into a structured knowledge graph. A key decision we made in this system was to utilize the REBEL model in joint extraction, which prevents cascading errors in many traditional NLP models. The system has shown that it technically possible and achieving a 52.3% relational accuracy and providing the system with the ability to extract the relevant information from unstructured, complex text. We implemented a logic layer using NetworkX to control how entities and relationships interact and make connections. The system achieved an MRR of 0.4759.
To make these backend processes accessible, we developed a Flask-based web interface. It allows users to interact with the data and graph. The system features a querying engine where the user can ask specific queries related to uploaded text or documents to uncover hidden paths and relationships, and also the interface provides a transparent view of the triplet database. It allows the user to analyze the extracted facts easily and download the data locally in CSV file format for further use. Ultimately, the system helps to convert amounts of text into a structured knowledge graph.
References
[1] A. A. Shahid and M. T. Afzal, \"A review on knowledge and information extraction from PDF documents,\" Frontiers in Artificial Intelligence, vol. 8, 2025.
[2] S. Polat, I. Tiddi, and P. Groth, \"A review on scientific knowledge extraction using large language models,\" arXiv preprint arXiv:2412.03531, 2024.
[3] E. F. Tjong Kim Sang and F. De Meulder, \"Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition,\" in Proc. 7th Conf. Natural Language Learning at HLT-NAACL 2003, 2003, pp. 142–147.
[4] C. Cortes and V. Vapnik, \"Support-vector networks,\" Machine Learning, vol. 20, no. 3, pp. 273–297, Sep. 1995.
[5] I. Hendrickx et al., \"SemEval-2010 Task 8: Multi-way classification of semantic relations between pairs of nominals,\" in Proc. 5th Int. Workshop Semantic Evaluation, 2010, pp. 33–38.
[6] A. Patel and L. Wang, \"Transformer-based approaches for semantic relation classification,\" Computer Science Review, vol. 51, 2024.
[7] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, \"BERT: Pre-training of deep bidirectional transformers for language understanding,\" in Proc. NAACL-HLT, 2019, pp. 4171–4186.
[8] J. Lee and S. Kim, \"Hybrid models combining transformers and GNNs for relation extraction,\" Journal of Artificial Intelligence Research, vol. 12, pp. 45–58, 2024.
[9] Y. Yao, C. Mao, and Y. Luo, \"Graph convolutional networks for text classification,\" in Proc. AAAI Conf. Artificial Intelligence, 2019, pp. 7370–7377.
[10] M. Schlichtkrull et al., \"Modeling relational data with graph convolutional networks,\" in Proc. ESWC, 2018, pp. 593–607.
[11] Y. Zhang et al., \"Graph transformer networks for heterogeneous knowledge graph learning,\" IEEE Trans. Knowl. Data Eng., vol. 37, no. 2, 2025.
[12] H. Liu and J. Zhao, \"Dynamic graph learning for knowledge extraction using attention mechanisms,\" Frontiers of Computer Science, vol. 19, no. 1, 2025.